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Abstract 


A new solver featuring time-space adaptation and error control has been recently introduced 
to tackle the numerical solution of stiff reaction-diffusion systems. Based on operator splitting, 
finite volume adaptive multiresolution and high order time integrators with specific stability 
properties for each operator, this strategy yields high computational efficiency for large multi¬ 
dimensional computations on standard architectures such as powerful workstations. However, 
the data structure of the original implementation, based on trees of pointers, provides limited 
opportunities for efficiency enhancements, while posing serious challenges in terms of parallel 
programming and load balancing. The present contribution proposes a new implementation of 
the whole set of numerical methods including RadauS and ROCK4, relying on a fully different 
data structure together with the use of a specific library, TBB, for shared-memory, task-based 
parallelism with work-stealing. The performance of our implementation is assessed in a series 
of test-cases of increasing difficulty in two and three dimensions on multi-core and many-core 
architectures, demonstrating high scalability. 

Keywords: Task-based parallelism, multi-core architectures, multiresolution, adaptive grid, stiff 
reaction-diffusion equations 

AMS subject classifications: 65Y05, 65T60, 65M50, 65L04, 35K57 

1 Introduction 

Stiff reaction-diffusion systems model various complex phenomena across different disciplines such 
as combustion, atmospheric sciences, plasma physics or biomedical engineering. Such models can 
also involve the dynamics of moving fronts, usually very localized in space (see, e.g., |23j and 
references therein), and potentially entail a large number of unknowns. A general reaction-diffusion 
system can be written as follows, for j = 1, 2,..., to: 
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with the compact notation: u = (ui ,..., Um)*- In particular we denote fiu) = (/i(u),..., /^(u))*. 
With no loss of generality we restrict our presentation to homogeneous Neumann boundary condi¬ 
tions. 

Two major difficulties need to be addressed when solving numerically this kind of problem 
in two and three dimensions. First, a large spectrum of temporal scales in the nonlinear source 
terms yields highly stiff equation^ Systems of stiff ordinary differential equations impose the use 
of numerical methods with specific stability properties in order to achieve accuracy and stability 
within reasonable memory and computing costs |36j . Secondly, steep fronts require a very fine 
discretization mesh, at least locally, which leads to problems of large size if no mesh adaptation is 
used. Additionally, spatial stiffness may arise as a consequence of these steep spatial gradients even 
with non-stiff source terms and diffusion coefficients of relatively small value m- 

We have recently introduced in m a tailored numerical strategy to cope with the latter difficul¬ 
ties using reasonable computing resources, that is, on a sufficiently powerful workstation, possibly 
exploiting shared-memory parallelism |26j . It relies on time operator splitting with one-step, 
high order integration schemes, namely, RadauiQ [36] and Rock^ [T] for the time integration 
of the reaction and diffusion sub-systems, respectively. The discretized equations are solved on a 
dynamically adapted grid generated by multiresolution analysis (MRA) in a finite volume multi¬ 
dimensional framework in the spirit of the original work of Harten |S7j and then Cohen et al. and 
Muller [ini ESI |S7| . MRA is based on a wavelet decomposition and a multiresolution transform, 
yielding both highly compressed representations for problems displaying localized fronts as well as 
a compression error control with respect to the solution on the full grid. Let us emphasize that 
this property is a major difference with an AMR (Adaptive Mesh Refinement) strategy where the 
refinement criterion rather relies on heuristics. This MRA-operator splitting numerical strategy was 
implemented in the MB ARETE code [l^, using a tree-structured data with pointers and recur¬ 
sive navigation. However, the latter implementation shows serious limitations in terms of parallel 
programming due to the lack of data locality and load balancing. A new paradigm of parallelism 
together with a different and customized data structure is thus needed to achieve more efficient 
MRA-operator splitting implementations. 

A wide spectrum of techniques and runtime implementations are available for application de¬ 
velopers to express parallelism. For the problems we are interested in and using MRA, one single 
modern compute node provides enough memory and computing power to run simulations with a 
reasonable time-to-solution. Therefore, we have so far focused on parallelism over shared-memory 
architectures as it provides a number of advantages well-adapted to MRA applications. Contrary 
to uniform Cartesian grids for which arrays can be generally accessed following regular patterns 
within long loops, adaptive meshing in general, that is both multiresolution and AMR codes, relies 
on fine-grained and dynamic data structures. The corresponding algorithms can thus have intricate 
data dependencies, especially for complex operations such as those associated with mesh adapta¬ 
tion. Hence, exposing parallelism in these methods is best done using programming techniques 
that combine both coarse- and fine-grained parallel constructs, keeping in mind that maximizing 
parallel coverage is crucial to limiting the impact of Amdahl’s law on scalability. As a consequence 
of the complex layouts of the resulting adapted meshes, an efficient parallel implementation requires 
non-trivial balancing of the computations across all available computing cores. Moreover, the load 
balancing needs to be frequently updated as the mesh structure evolves during the simulation. 
Actually, many existing mesh refinement algorithms and libraries address this issue in distributed- 
memory settings for both patch- and cell-based AMR (TT] [501 HI 11311331 IMl |001 [TOl jO) [TH E31 HU HO]. 

^The latter can be expected when the Jacobian matrices, have eigenvalues whose real part 

varies within a large negative interval or in the presence of strong attracting slow manifolds m- 

^High performance computing can be achieved by choosing dedicated schemes for each split sub-system [211 Eg. 
Operator splitting schemes also exhibit a large data-driven parallelism, and their mathematical analysis is well- 
established for relatively large splitting time steps even when stiffness is present [1911171 fT^ . 

^An implicit, fifth order Runge-Kutta scheme with A- and L-stability properties and time step adaptation, 
stabilized, explicit Runge-Kutta method of order four and time step adaptation. 
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However, even if MRA leads to compression error control, the mesh adaptation relies on a mul¬ 
tiresolution transform that involves recursive navigations throughout the data sets [7]. The latter 
makes it much more difficult to parallelize than classic AMR approaches. Some advances have been 
conducted recently into that direction but in a different framework EH 01117]. 

In shared-memory architectures, the common memory address space enables dynamic load bal¬ 
ancing while avoiding data transfers, but it also requires careful synchronizations between threads. 
Starting from the original implementations on a single computer El 01, several authors have pro¬ 
posed multithreaded implementations for adaptive gridding techniques imn] with some focusing 
on specialized grid structures for multi-core processors [H [SS] . Task-based parallelism provides an 
attractive shared-memory solution to both granularity and dynamic load balancing requirements. 
In a task-based approach the programmer introduces parallelism by specifying computations that 
can be carried out in parallel. Expressing tasks can be done using different techniques, for example, 
relying on compiler directives or through library calls. Scheduling of the tasks is determined at 
runtime based on the available computing resources, yielding dynamic load balancing using tech¬ 
niques such as work-stealing Em. A key feature associated with tasks is that parallelism is not 
limited to flat parallel iteration constructs, but can be introduced recursively by having tasks create 
other tasks. This makes the task concept particularly suited to codes with complex hierarchical 
operations and recursive navigation in the data set, such as MRA applications. 

We have thus opted for a shared-memory parallel approach based on work-stealing, relying 
on the Intel Threading Building Blocks (TBB) library [521 IMl- While other runtime libraries 
supporting task parallelism are currently available, such as the OpenMP application programming 
interface (API), and could have been used for our purposes with equivalent functionality, they would 
have required a much more important programming effort. Shared-memory parallelism was hence 
introduced using the task-based Intel Threading Building Blocks runtime library. The latter turns 
out to be particularly well suited to the multiresolution algorithm in order to cope with nested 
parallel regions, which are directly related to the MRA-splitting strategy. 

Additionally, a completely new implementation of the original numerical strategy of MBARETE 
code has been conducted, where a different and more efficient data structure has been introduced 
as an alternative to the original pointer-based approach. Important performance enhancements due 
to work-stealing techniques and in particular the use of TBB have already been shown in EH [51] 
for multiresolution schemes applied to non-stiff time-dependent PDEs, that is, using explicit time 
integration scheme^ 

In order to assess the new implementation and code performance we have chosen three reaction- 
diffusion models with increasing complexity in two {d = 2) and three {d = 3) dimensions: 

I. NAGUMO. A bistable, Nagumo-type reaction-diffusion equatiorj^ Here we have m = I and 
/i(wi) = kuf{l — Ui) in Q. We consider k = 10, Si = 0.1, and an initial condition satisfying 
0 < ^^(a;) < 1 every wher^ 

H. BZ. The Belusov-Zhabotinsky reaction 05] 00]. This is a system of three (m = 3) equa¬ 
tions, where the reaction term is given bjj^ /i(ui, U 2 , ua) = 10^ (—0.02 ui — U 2 U 1 -\- 1.6 U 3 ), 
/ 2 (mi,U 2 ,U 3 ) = 10^ (u 2 — W 2 — Ui{u 2 — 0.02)), and / 3 (ui,U 2 ,M 3 ) = U 2 — U 3 . For the diffusion 
part, we set ei(a:) = £2(3;) = 2.5 x 10“^ and £3(3:) = 1.5 x 10“^ in ([T]j^ 

^These authors developed a data structure based on the definition of wavelet blocks, specifically a tree of wavelet 
blocks, where each block contains a predefined number of cells at the same grid-level. 

bistable case of type A according to EH, which is a limit case of the Nagumo function, similar to what is 
found in combustion models m or in nonlinear chemical dynamics when studying traveling waves due, for instance, 
to chemical reactions with cubic auto-catalysis m- 

^In general the reaction term is not stiff as \dfi{ui)/dui\ < 3k. However, with this setting, traveling waves develop 
with a sharp spatial gradient yielding a space multi-scale configuration. 

®This set describes a chemical reaction between HBr02, Ce{IV) and Br~ , also known as the Oregonator problem. 

^The reaction term is stiff. The system of ordinary differential equations has a limit cycle along which the 
amplitude of the eigenvalues of the Jacobian attains values of the order of 10^. As a comparison, when computing in 
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III. STROKE. An ischemic stroke model pTl I16| . This is a system of 21 (m = 21) equations 
with a very stiff reaction terirp^ Its computation is performed by a quite complex and 
computationally expensive program of about 400 instructions, containing many log-function 
evaluations. 

The paper is organized as follows. After a brief description of the key aspects of the numerical 
strategy in Section we focus in Section on the task-based parallel implementation developed 
in this work. A performance analysis is then presented in Section]^ to end with some concluding 
remarks and future developments related to this work. 


2 Numerical strategy 

We are interested in two- and three-dimensional simulations. Discretizing 0 in space yields a large 
and stiff system of ordinary differential equations given by 

^=A,D + E(C/), (2) 

where A^ is a matrix corresponding to the discretization of the diffusion operator. Using a standard 
centered second-order discretization on a Cartesian mesh, A^ is given by a matrix with five (resp., 
seven) non-zero elements per line in two (resp., three) dimensions. Following |27j a dedicated 
splitting solver, briefly described in what follows, is implemented to advance system ([^ in time. 


2.1 Dedicated splitting solver 

Considering independently the two sub-problems coming from ([^; 




'o 

II 

VF(0) = Wo, 


we denote by DAtVo and RAtWo the solution of the first and second sub-problem, respectively, 
after a given splitting time step At. Assuming in general stiff reaction terms, we consider without 
any loss of generality the second-order Strang schemtp] [SU] . Un+i — RAt /2 ° -^At o RAt/ 2 Un- 

The RadauS method is used to solve each of the independent initial value problems coming 
from the reaction term and defined at each grid-point, during the Reaction Steps, RAt/ 2 ] whereas 
the Diffusion Step, Da*, uses the explicit ROCK4 method to globally advance the discrete diffusion 
sub-problem. Notice that with the latter choice of solver, one only needs to evaluate matrix-vector 
products without any need of preconditioning nor factorizing the diffusion matrices when the mesh 
has changed. In this work we have chosen splitting time steps, At, equal to 10“^, 10“^, and 1 for 
the NAGUMO, BZ, and STROKE models, respectively. 

Finally, in terms of parallelism the i?At/2"Step exhibits a high data-driven parallelism, for one 
has as many independent problems as nodes in the spatial discretization. The situation is different 

fl = [0,1]'^ with a spatial discretization step of h = 1/1024, the largest negative eigenvalues in the diffusion term are 
about —2 X 10'^. Propagating fronts with steep spatial gradients are also developed in this case yielding a time-space 
multi-scale configuration. 

^'^The right-hand side, f{u), incorporates an important amount of biological knowledge about ion channels (see m 
for the model here considered). A numerical computation of the eigenvalues of the Jacobian near a stable equilibrium 
state, f{u) = 0, shows real parts in the interval of [—10®,0[. Stiffness is due in part to the fact that f{u) models 
voltage-gated ion channels that open or close when the difference of potential between a cell and the surrounding 
media attains a given threshold. Gates are modeled through sigmoid functions, closely approximating a Heaviside 
function. Simulations show steep spatial gradients, as well. 

^^It has been shown that better accuracies are expected by ending the splitting scheme with the substep involving 
the fastest time scales |181ll9llT5] . Both theoretical and numerical results show that this method performs very well 
for stiff reaction-diffusion systems |23l 1271128| . 
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for the -DAfStep, which involves the whole computational domain. A simple and straightforward 
parallelism consists in parallelizing the solution of the m independent parabolic linear equations 
in Q [5^. However, an alternative that is not limited by the number of equations should be 
considered to enhance the parallel performance of the diffusion solver. The latter is much simpler 
to achieve in the case of an explicit scheme like ROCK4, since matrix-vector products are easier 
to parallelize than linear solvers. 

2.2 Adaptive multiresolution method 

For the kind of problems we are interested in, featuring propagating localized fronts, an adaptive 
meshing technique that reduces the number of grid-nodes over near-equilibrium zones can signifi¬ 
cantly reduce the CPU timj^ together with the required computer memory. In the following we 
describe some basic features of the multiresolution method introduced in |10j . and implemented in 
[27] with the aforementioned dedicated splitting solver for reaction-diffusion problemj^ 



Figure 1: One-dimensional graded tree composed of nested grids. The resulting adapted mesh is 
given by the leaves of the tree, indicated in bold. 


Without loss of generality we consider the computational domain O = [0,1]^^, d = 1,2,3. A 
recursive dyadic subdivision of 11 is then performed, yielding 2‘^1 cells (segments, squares or cubes) 
of equal size at each grid-level j, from j = 1 to j = J. Levels 0 and J correspond, respectively, 
to the whole computational domain LI and the mesh at the finest spatial resolution. An adapted 
spatial representation can be thus achieved by trimming the set of nested grids throughout the 
tree, as shown in Figure for a one-dimensional case. In particular the adapted grid is given by 
the leaves of the tree. Finite volumes are associated with all cells of the tree, and throughout 
this paper we will refer to all cells, including the leaves, and their corresponding finite volumes as 
nodes. The root of the tree is the entire domain Ll. Even though the solution of problem Q is 
only performed on the adapted mesh, that is on the leaf nodes, the solution is updated and stored 
at all nodes of the tree. The set of values at nodes of level j is denoted as Uj. Data at different 
levels of discretization are related by two inter-level transformations defined by the projection and 
prediction operators. The projection operator, Pj_i, which maps Uj to U^-i, is obtained through 
exact averages computed at the finer level; that is, for a given node at level j — I one computes the 
average of the 2^ nested nodes at the successive finer level j. The prediction operator, Pj~^, maps 
Uj-i to an approximation JJj of U^-; that is, the value of a given node at level j is approximated 
based on the surrounding coarser nodes at level j — 1. 

Grid adaptation is performed based on local estimators of the spatial regularity of the solution 
at a given simulation time. These local estimators are known as details., and for a given node at 
level j its detail is defined as 

^j,k ~ ^j,k ~ ^j,k Pj ^ Pj—i'^j,k: (3) 

^^Since this is a key aspect, a numerical experimentation on uniform meshes is provided in Appendix [^ for the 
three aforementioned models. 

^®More details on this particular multiresolution implementation can be found in m- 
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where k accounts for the location of the node at level j, and Uj^k represents the cell-average of 
u{x, t) there. From the theoretical point of view, the details correspond to the coefficients related 
to a given discrete function when represented on a wavelet basis. This wavelet decomposition is a 
general procedure, independent of any physical particularity of the problem. Introducing a tolerance 
parameter, e > 0, threshold values are defined level-wise as Cj = 25 j = 0,1,..., J. Data 
compression is thus achieved by discarding nodes whose details are smaller than Cj in a given norm; 
here we consider an L 2 -iiorm |23| . Conversely, nodes whose details exceed Cj must be refined. Grid 
adaptation is hence driven by the local value of the details, while the tolerance parameter defines the 
level of approximation errors introduced by the data compression m- This is the main difference 
between MRA and AMR strategies. In particular, while enabling compression error control, the 
multiresolution analysis also leads to recursive navigation in the data set. 

In this work the approximated values, tJj, generated with the prediction operator, are obtained 
using centered polynomial interpolations of accuracy order, o = 2Z -|-1, computed with the I nearest 
neighboring cells in each direction including the diagonals in multi-dimensional configurations. For 
a one-dimensional configuration with 1 = 1, the prediction operator is then given by |10j 


Uj + l,2k = Uj^k + n (“t.fe-l ~ Uj^k+l), Uj + l,2k-\-l = Uj,k + X (Wj,fc+1 — Uj^k-l)- 


(4) 


Extension to multi-dimensional Cartesian grids is easily obtained by a tensor product of the one¬ 
dimensional operator Q (see, e.g., [S5]). The interpolation stencil is thus given by {21 -I l)'^ cells at 
the coarser level. Here we will only consider third-order interpolations with 1 = 1. Notice that these 
local interpolation stencils must be available throughout the tree-structure during the simulation. 
A tree-structure that complies with such a constraint is called a graded tree nnj. Hence, a graded 
tree-structure must be always guaranteed after the refining and coarsening operations. 


3 Task-based parallelism for MRA-splitting applications 

We have introduced shared-memory parallelism using the task-based Intel Threading Building 
Blocks runtime library [52l |39]. It turns out to be particularly well suited to the multiresolu¬ 
tion algorithm in order to cope with nested parallel regions, which are directly related to the 
MRA-splitting strategy. Other than TBB, several runtime libraries supporting task parallelism are 
currently available, such as the OpenMP API m, XKaapi [35], among others. They could po¬ 
tentially have been used for our purposes with equivalent functionality, but requiring a much more 
important programming effort. A discussion clarifying the reasons why we have opted for TBB, in 
particular over OpenMP API, is provided in Appendix |B| 

3.1 Parallel implementation of the multiresolution algorithm 

One of the main constraints on parallel implementations of the multiresolution algorithm has to do 
with the data locality required to efficiently perform the inter-level operations throughout the tree 
data structure (quadtrees or octrees in two or three dimensions, respectively). More specifically, 
the prediction operator needs to access all nodes in the interpolation stencil, meaning that for any 
given node, one must access the parent-node at the successive coarser level and all its neighboring 
nodes, as shown in Figure in the case of quadtrees. On the other hand, the projection operator 
needs to access the child-nodes of any given node that is not a leaf. Notice that it is very unlikely to 
achieve data locality in standard pointer-based tree data structures, like the one considered in the 
MBARETE code. In the latter case nodes are generated and linked recursively and even though 
pointers to the neighboring nodes are introduced |23| , nodes belonging to a given interpolation sten¬ 
cil will hardly be stored contiguously in memory. Additionally, other the projection and prediction 
operators, the refinement and coarsening operations, also performed on trees, should be executed in 
parallel as well, in order to maximize parallel coverage and minimize the impact of Amdahl’s law. 
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Figure 2: Prediction stencil nodes to compute values at nodes P within a given quadtree. The 
stencil comprises nodes belonging to four different quadtrees: Qi, i = 1,2,3,4. 


The data structure implemented in this work to support quadtrees and octrees, as well as the 
computation of the key multiresolution operations are detailed in what follows. 

3.1.1 Data structure and implementation of trees 

Without loss of generality we consider again the computational domain Lt = [0,1]^, d = 2,3. The 
dyadic refinement of cells allows us to define the coordinates of a given node, (xi, ..., Xt), i = l,d, 
using the binary numeral system. That is, for each dimension i, the cci-coordinate of a given 
node P at the grid-level j is given by Xi = O.Xi^iXi ^2 ■ ■ ■ Xtj-iXij, where Xij-i is equal to 0 or 1 
depending on which of the two possible nodes at grid-level j — 1 generated the node P; and the 
same follows for Xij- 2 ^ up to Using this convention one can easily define a space-filling curve 
[5S] throughout all nodes of the tree; a powerful tool already employed in multiresolution lisa 
and AMR [31 [52 applications. In our particular case we use a Morton order space-filling curve 
|45j . also known as Z-order or Morton code. For instance, in two dimensions, for a given node 
(xi = 0.xi,ia;i.2 • • • xij, X 2 = 0.a;2,ia:2,2 • • • X 2 ,j), its Morton abscissa is constructed by alternating 
the digits of Xi and X 2 '. 

S = Xi^iX2,lXi^2X2,2 ■ ■ ■ XijX2,j', 

and the same follows in three dimensions. In this way each node of the tree is uniquely defined 
by an integer. For each node of the tree we can then store in a 64-bit integer, its abscissa s and 
its corresponding grid-level j. In our implementation the first 48 bits are intended to contain the 
abscissa and the following four, the grid-level, leaving the remaining bits available for tagging pur¬ 
poses during the multiresolution operations, as shown in Figure In the case of three-dimensional 
problems, the latter choice allows one to encode nodes of trees with up to 16 grid-levels. 


abscissa 
(48 bits) 


level free 

(4 bits) (12 bits) 


Figure 3: Representation of a node using a 64-bit integer. 

This representation of nodes is similar to the so-called CSAMR data structure introduced in 
|41j . which then uses hash tables for efficient node lookup. In our implementation we also employ 
hash tables, but over clustered nodes, in what follows denoted as blocks, which enables further 
control on lookup granularity, and eases the operations for shared-memory parallelism. Taking into 
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account that r = O.s belongs to [0,1[, we partition [0,1[ into intervals. The data structure is thus 
given by blocks and collections of blocks. A block is then a structure that contains data related to a 
given interval, namely, the interval bounds: Smin and Smax! a vector containing all nodes such that 
s belongs to [smin) Smax[; and the size of the vector. A collection of blocks is given by pointers to the 
blocks, ordered according to their Smin values. In practice, based on preliminary tests, we operate 
with two collections of blocks, one for the leaf nodes and another one for the non-leaf nodes. 

Abscissas are not necessary ordered inside blocks, and binary search is used to locate the block 
where the abscissa of a given node is possibly stored, followed by a sequential search within that 
block. In our implementation the latter strategy has proven to be the most efficient, in particular 
compared with one based on a fully ordered storage. A system of caches keeps track of the last 
blocks accessed, effectively decreasing the number of binary searches. These caches are local to 
TBB tasks. 

Additionally, five auxiliary operations are implemented to properly handle this data structure. 
Two of them aim at maintaining the size of the blocks within a predefined range, by splitting 
into two a relatively large block or merging two neighboring blocks of relatively small size. The 
complexity of the search is hence of 0(log2 Nt,), where Nf, is the number of blocks in the collection. 
The next two operations are used to enlarge or reduce the size of the vectors contained in the 
blocks. These are performed by allocating a new vector of suitable size, copying the content of the 
original vector, and finally, deleting the original vector. The fifth operation is implemented as a 
garbage collector, for instance, to effectively delete previously tagged nodes. Notice that all these 
five operations are such that can be performed in parallel, by using the parallel_for structure of 
the TBB library while looping over blocks. 

Finally, the refinement process entails adding the abscissas of the new nodes at the end of the 
vector of the block that contains the abscissa of the original node. On the contrary, the coarsening 
process is performed using a lazy approach; that is, all nodes that need to be deleted are first 
tagged, using the free space in their 64-bit representation, while in a second stage the garbage 
collector effectively remove them from their corresponding blocks. Further details are given in the 
following. 

3.1.2 Parallel implementation of the key multiresolution operations 

Grid adaptation in the multiresolution algorithm is essentially driven by the value of details ([^, 
which in turn is computed using the projection and prediction operators. In practice, details are 
computed in two stages, as follows. 

I. Projection. Data is propagated from the leaf nodes to the root of the tree, an operation 
easily cast as a recursive computation. Based on preliminary tests, we came out with the 
following implementation. Identifying a relatively low, fully populated level, jmim two recur¬ 
sive computation are performed. First, for every node P at grid-level jmin, data is recursively 
transferred from the corresponding leaves down to P. These are independent computations, 
hence performed in parallel by using recursive parallel_for constructions. In a second step, 
a sequential computation propagates recursively data from level jmin to the root of the tree. 
In this work, jmin was taken equal to 3 and 5, respectively, for three- and two-dimensional ap¬ 
plications. It is worth noticing that often the sequential computational, although inexpensive, 
needs not be performed, as in general no grid adaptation is present at those low grid-levels. 

II. Prediction and detail computation. Details are computed for every node P according to 
([^, after applying the prediction operator by means of interpolation, in the following schematic 
manner. 

for P G {Nodes} do 

Define the list L of Nodes in the prediction stencil associated with P (see, e.g., Figure 

i; 


Get values of Nodes in L; 
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Compute the predicted value and the detail; 

Store the detail. 

end for 

All these computations are easily implemented within a parallel_for structure, looping over 
blocks. 

Once the details have been computed, the adapted grid is updated, also in parallel, in two stages 
by means of the refinement and coarsening procedures, detailed in what follows. Notice that in all 
cases a lazy strategy was again implemented. 

I. Refinement. Based on the detail values, nodes that need to be refined are first tagged 
in parallel. Then, all tagged cells are refined. The resulting tree will not necessary be a 
graded one. Therefore, additional nodes must be tagged for refinement in order to guarantee 
a graded tree-structure. The process of tagging and refining is thus repeated until a graded 
tree-structure is achieved. 

The task-based refinement operation is implemented in the following way. Each available task 
considers a given set of consecutive blocks, that is blocks of nodes within a given abscissa 
interval, say Whenever a node is created by refinement, it is directly stored in the 

corresponding block that contains its abscissa, as previously explained. However, if necessary, 
two new blocks are created by each task to temporarily store those new nodes whose abscissas 
are not contained in s/[ (one for abscissas smaller than Si, and another one for those larger 
than or equal to s/). These operations can be safely done in parallel because no task will 
be writing into any set or temporary block belonging to another task. Once all tagged nodes 
were refined, a new task-based operation starts, again over consecutive sets of blocks, to move 
the previously created nodes from the temporary blocks to the corresponding block actually 
containing its abscissa. The latter process can be also safely performed in parallel because it 
only considers read-only accesses. 

II. Coarsening. At this point, leaf nodes that are no longer necessary can be eliminated. The 
latter applies to leaf nodes whose parent-nodes were not tagged for refinement in the previous 
stage, and that can be deleted without compromising the graded tree-structure. 

The task-based coarsening operation is implemented in the following way. Each available task 
considers a given set of consecutive blocks, and tags the leaf nodes that need to be deleted. 
Once the tagging phase is over, parent-nodes of tagged nodes are moved to the collection 
corresponding to leaf nodes, using the same task-based techniques described for the refinement. 
Einally, a parallel call to the garbage collector effectively deletes the tagged nodes. These three 
operations are repeated until no node deletion is needed. Again these computations are safely 
performed in parallel using parallel_f or over blocks. 

3.2 Parallel implementation of the splitting solver 

Before considering the parallel implementation itself, we have to define the way of storing the 
unknowns of system Q in memory. This set of unknowns needs to be solved in every leaf node of 
the tree, and there are basically two options to store them in memory, following either an array of 
structures or structure of arrays approach. The first option consists in storing all data related to a 
given node, in particular the unknowns, in a common structure, thus achieving node-wise memory 
contiguity. On the other hand, the second option consists in storing each unknown in its own array, 
thus achieving variable-wise memory contiguity. 

In general, within a splitting framework, the first layout turns out to be the most appropriate 
for the reaction solver, as it computes local reaction rates using all the unknowns at the current 
node. The diffusion problem, on the other hand, is solved for each unknown independently, hence 
a structure of arrays would provide better data locality. Nevertheless, the reaction solver is highly 
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compute-intensive and therefore, largely insensitive to the memory layout, whereas the diffusion 
step displays very low compute-intensity. Consequently, we have opted to store each unknown in its 
own contiguous array. As we will later see, a large fraction of the runtime is spent in the reaction 
and diffusion steps; having such a memory layout proves then to be very relevant, especially when 
the variables are not coupled in the diffusion operator. The Morton order defines the order of the 
unknowns, which are stored in vectors. Additionally, the sparse matrices used during the diffusion 
steps can be stored according to a classical CSR storage [IS]. However, an even more compact format 
was developed in this work for diffusion problems with constant diffusion coefficients (see details in 
Appendix . 

As previously mentioned, the parallel implementation of the reaction solver is quite straightfor¬ 
ward. Once the access to the values of the unknowns rt at a given node is defined, the reaction solver 
operates over the N leaf nodes present in the adapted grid as follows. The reaction solver class is such 
that computes the solution for a given number of nodes, defined as a blocked_raiige<size_t> within 
the TBB framework. Then, a call to parallel_for(blocked_rajige<size_t>(0,N) .Reaction) par¬ 
allelizes the computation, where Reaction corresponds to the solver class. 

For the diffusion solver, two levels of parallelism is implemented. Other the aforementioned 
parallel computation of the m unknowns in system 0. the matrix-vector products performed 
by ROCK4 are also parallelized, using two parallel_for constructions. Further details on this 
ROCK4 implementation is described in what follows 


3.2.1 Implementation of ROCK4 


ROCK4 is an explicit time integration method, which can be viewed as the composition of two 
methods that are successively performed [T]. The first one is based on orthogonal polynomials and 
uses a three-term recurrence formula where the number of function evaluations, hence stages of the 
method, depends on the spectral radius of the Jacobian of the system. The second one is an explicit 
four-stage Runge-Kutta scheme. For a system dy/dt = f{y) and a time step 6 t, the latter is given 

by 

i-1 



I — 1, 2, 3,4, 


i=l 


(5) 


where aij and bi correspond to the coefficients of the Runge-Kutta scheme. 

Computing linear combinations of large vectors of double type, as in ([^, is characterized 
by a rather low arithmetic intensit-J^ less than 1/4, because of the multiple memory accesses 
associated with reading and writing temporary arrays in memory. Nevertheless, achieving a high 
arithmetic intensity is fundamental to obtain a high FLOP/s efficiency (see, e.g., |S3]). In our 
implementation, the arithmetic intensity is largely improved by exploiting the fact that an explicit 
Runge-Kutta scheme applied to a linear problem dy/dt = Ay is given by yn+i = '^{5tA)yn, where 
TT is a polynomial whose coefficients are a function of the a^j ’s and fc^’s coefficients [35], which 
are known and precomputed. In this way, we use the Horner’s method to compute the four-stage 
Runge-Kutta formula, thus reducing considerably the aforementioned memory traffic. Notice that 
a factorization of tt is not feasible because this polynomial has complex roots. Finally, the spectral 
radius p{A) of A, used to determine the number of stages of the method, is estimated using the 
Gershgorin circle theorem. 


^^The arithmetic intensity is a measure of floating-point operations (FLOPs) performed by a given code (or code 
section) relative to the amount of memory accesses (Bytes) that are required to support those operations. 
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4 Code performance and scalability 

We have developed a C++ code, named Z-code, which operates on multi-dimensional configura¬ 
tions; the dimension d is a template parameter for most classes and methods. It runs successfully 
on modern Linux systems, and has been tested with both gcc (version 4.8 and higher) and the Intel 
C++ (version 14.0.1 and higher) compilers. Figure]^ shows some results for the BZ problem. 



Figure 4: BZ model, two-dimensional simulation over 9 grid-levels. Variable ui (left) with the 
corresponding adapted grid defined at different grid-levels (right). 

Regarding the methods considered for the dedicated splitting solver, codes for the Radau5 and 
ROCK4 methods are publicly available on the Internet [M]. These codes, albeit very efficiently 
coded in Fortran 77, can be substantially improved. For RadauS we have carefully rewritten its 
implementation in C++, thus removing a significant amount of branching. The latter yields an 
enhanced performance between 20 to 30%. In the case of ROCK4, we have developed a com¬ 
pletely new implementation, efficiently adapted to linear problems, that reduces memory accesses, 
as described befortf^ 

Numerical performance can be measured from many points of view. In what follows we first 
present some comparisons between computations on uniform Cartesian and multiresolution adapted 
meshes. We then focus on the scalability attained for a given problem with increasing thread count 
(strong scaling). The following tests were performed on an Intel E5-2650 v3 platform (2 Haswell 
CPUs with 20 cores in total and 40 threads with hyperthreading) and on a Xeon Phi (MIC) SHOP 
(60 cores, 240 threads with simultaneous multithreading, and 8 GB of main memory) using the 
Intel icpc compiler. 

4.1 Uniform Cartesian meshes versus multiresolution computations 

Grid adaptation results in significant data compression. However, it is worth verifying whether 
the latter effectively leads to performance enhancement in terms of runtime, given the additional 
computations required to dynamically adapt the grid and the more elaborate implementation. 

Nevertheless, in order to guarantee a fair comparison, we have replaced the diffusion implemen¬ 
tation in the Z-code with one more suited to uniform meshes, for all runs performed on a uniform 
grid. Specifically, the matrix-vector products were replaced with a more efficient procedure for five- 
and seven-point stencils, using two- and three-dimensional arrays, respectively, and cache blocking. 
The latter avoids the additional cost of manipulating more complicated data structures and reduces 
the bandwidth of the computation. The following results were obtained on the Haswell machine, 
using 40 threads over 20 cores. 

^®The sources of Z-code and related software can be provided by contacting T. Dumont, the corresponding author 
of the present contribution. All codes are distributable under CeCILL-B licence (http: //www. cecill. inf o/licences. 
en.html). 
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Recalling that J stands for the maximum grid-level allowed for mesh refinement, Tables and 
gather the wall-clock computing times in seconds, for various grid-levels J and the aforementioned 
three problems in two and three dimensions. Additionally, we can define a compression ratio in 
percentage, Ci?, as 

CR=il - ^ ) X 100, 

V Nj) 

where N and Nj correspond to the number of nodes in the adapted grid and at the finest level, 
respectively. In particular Nj is given by 2'^^ and is equal to the number of nodes in the corre¬ 
sponding uniform Cartesian mesh. During all these computations, the compression ratio remained 
between 80 % and 83 %. 



J 

8 

9 

10 

NAGUMO 

MR 

CM 

2.51 X 10-^ (1.7) 
1.49 X 10-3 

3.02 X 10-3 

8.27 X 10-3 (2.7) 

4.97 X 10-3 

2.09 X 10-2 (4 2) 

BZ 

MR 

CM 

1.14 X 10-2 

1.31 X 10-2 (^ 2) 

2.51 X 10-2 

4.06 X 10-2 g) 

4.06 X 10-2 

1.78 X 10-1 (4 4) 

STROKE 

MR 

CM 

2.4 X 10-2 

3.2 X 10-1 ^23.3) 

4.03 X 10-2 

1.21 (30.0) 

1.03 X 10-1 

4.80 (46.6) 


Table 1: Wall-clock times in seconds for two-dimensional multiresolution (MR) and uniform Carte¬ 
sian mesh (CM) computations. Largest values are indicated in bold with the corresponding ratio 
in parenthesis. 



J 

8 

9 

NAGUMO 

MR 

CM 

0.13 

0.54 (2.5) 

0.31 

2.47 (4.5) 

BZ 

MR 

CM 

0.97 

3.05 (3.1) 

5.75 

23.60 (4.1) 

STROKE 

MR 

CM 

1.53 

76.68 (50.0) 

10.81 

610.50 (56.5) 


Table 2: Wall-clock times in seconds for three-dimensional multiresolution (MR) and uniform Carte¬ 
sian mesh (CM) computations. Largest values are indicated in bold with the corresponding ratio 
in parenthesis. 

These results clearly highlight the advantage of using the multiresolution algorithm to solve stiff 
problems disclosing localized reaction fronts with an appropriate grid resolution. As a matter of 
fact, looking at the previous results we see that computations on a uniform grid are more efficient 
only for the two-dimensional, non-stiff NAGUMO problem on a rather coarse grid. In general the 
additional cost related to mesh adaptation and more complex data structures is counterbalanced 
by the computing savings achieved with a compressed representation. The latter is even more 
relevant for problems with very stiff reaction terms, conhrming the findings and forecasts previously 
described in Appendix |A| 
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4.2 Analysis of scalability 

We investigate the strong scaling behavior of Z-code by timing one single simulation time step, 
and breaking down the total step time (S) into the mesh adaptation (A), reaction solver (R), 
and diffusion solver (D). Notice that the adapted grid is updated at every time step, the detail 
computation is hence included in the mesh adaptation process, and the diffusion solver must also 
construct the matrices used in the linear systems. 

The three reference problems probe different regimes of compute-intensity. The NAGUMO prob¬ 
lem is characterized by a rather inexpensive, non-stiff reaction term, whereas the computing effort 
for the BZ problem is roughly balanced between the reaction solver and the other two procedures; 
that for the STROKE problem, on the other hand, is completely dominated by a very expensive, 
stiff reaction term. Table shows the percentages of wall-clock time spent on each of the three 
major code phases, for three-dimensional simulations performed over 10 grid-levels (J = 10) and 
40 threads on the Haswell platform. 



NAGUMO 

BZ 

STROKE 

Mesh Adaptation 

68.00 

19.23 

7.20 

Reaction Solver 

0.92 

56.23 

88.89 

Diffusion Solver 

31.08 

24.54 

3.91 


Table 3: Percentages of wall-clock time for three-dimensional simulations with maximum grid-level 
J = 10 and 40 threads on the Haswell platform. Dominant contributions are indicated in bold. 

We now discuss scalability in terms of parallel efficiency. Let Tk be the wall-clock computing 
time in seconds when running on k threads. We define the parallel efficiency on k threads by 

rr, ) 

nk-Lk 

where Uk corresponds to the number of cores used when computing over k threads. In particular 
because of the simultaneous multithreading, that is, running more than one thread per core, is 
taken either as 

Uk = min(A:, 20), 

for the Haswell platform, or 

rife = min(A:, 60), 

for the Xeon Phi. In this way we compute the efficiency based on the number of CPU cores, not 
the number of available hardware threads. The motivation for this choice is that cores, unlike 
hardware threads, constitute independent compute units. The scalability plots therefore present 
scaling results as a function of independent hardware resources, i.e. cores, participating in the 
computation. An important consequence is that, when using hyperthreading, efficiencies may well 
exceed one. This simply means that running more than one thread per core may result in a more 
efficient execution at the core level than what can be achieved with a single thread. We will discuss 
hyperthreading in more detail in the following. Figure [^illustrates the parallel efficiency, Sk, for all 
three problems, on Haswell and Xeon Phi, in three-dimensional simulations. 

Overall, the code achieves good shared-memory scalability on compute-intensive problems (BZ 
and STROKE), with efficiencies over 80% over 2 Haswell sockets. The scaling of the reaction step 
is very good mainly because of its embarrassingly parallel nature and the dynamic load balancing 
provided by TBB; there is actually no shortage of parallelism in this part of the time step. In 
addition, since the reaction step is strongly compute bound, cores do not compete for shared 
resources, allowing for linear scaling. Only for the NAGUMO problem, whose reaction computation 
is quite inexpensive, does the reaction efficiency drop below 90%. 
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NAGUMO, level J= 10, Haswell 



. 1 socket 2 sockets 2 sockets with hyperthreading 

12 4 8 10 20 4 

Number k of TBB worker threads 



. 1 socket 2 sockets 2 sockets with hyperthreading 

12 4 8 10 20 4 

Number k of TBB worker threads 

STROKE, level J = 9, Haswell 



Number k of TBB worker threads 





Figure 5: Parallel efficiency Sk in terms of number k of TBB worker threads for the NAGUMO (top), 
BZ (middle), and STROKE (bottom) problems on Haswell (left) and Xeon Phi (right) architectures, 
for mesh adaptation (A), diffusion (D), reaction (R), and total time step (S). The blue-shaded areas 
highlight thread counts with more than one TBB worker thread per CPU core (2 threads/core on 
Xeon Haswell and 2-4 threads/core on Xeon Phi). 


The diffusion and adaptation steps achieve lower parallel efficiencies than the reaction one. A 
first reason for this is that both phases are more memory-intensive than the reaction; therefore, 
more pressure is put on the memory controllers, which are shared between cores of a given socket. 
This results in less-than-ideal scaling, as memory accesses become a bottleneck. Note that for the 
diffusion operator, we have used the compact structure described in Appendix which effectively 
improved performance over a classical CSR structure for the problems studied here. Other the 
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considerations on memory access, parallelization of the diffusion solver and multiresolution opera¬ 
tions is intrinsically more complex than that of the reaction solver, yielding slightly less exposed 
parallelism and a degradation of parallel performance due to Amdahl’s law. This highlights the 
importance of exposing as much parallelism as possible, in particular in complicated algorithms 
such as mesh adaptation via multiresolution. 

We now discuss the impact of simultaneous multithreading (hyperthreading) on the performance 
of Z-Code on the Haswell Xeon and Xeon Phi platforms. Hyperthreading allows the cores’ execution 
units to be used by a second application thread whenever the core stalls executing an instruction 
stream from the first thread. Typically, it is beneficial for applications that have good thread 
scalability, and where the execution is frequently stalled by latency-related events such as cache 
misses, branches, or bad pipelining. 

For Z-Code on Haswell, moving from 1 to 2 threads per core (20 to 40 TBB worker threads) pro¬ 
vides significant performance improvements, especially in the reaction solver where hyperthreading 
gains up to 1.25 speedup, with a geomean across problems of 1.20. The reaction solver features a 
very complex control flow in the Radau5 algorithm, together with BLAS function calls and op¬ 
erations on small matrices. All these contribute to core stalls due to bad instruction pipelining 
and complex branching. These latencies, together with the good thread scalability of the reaction 
solver, explain the signihcant performance gains attained by hyperthreading. 

By contrast, still on Haswell, the gains are more modest for the multiresolution algorithm 
(geomean 1.13) and even more so for the diffusion solver (geomean 1.05). The diffusion step features 
simpler control flow and, as previously discussed, its performance is dominated by memory controller 
bottlenecks, which hyperthreading cannot alleviate because they lie outside of the CPU core. 

In regard to the Xeon Phi, on our Knights Corner coprocessor, simultaneous multithreading is 
crucial to enhancing performance. Specihcally, it allows hiding some of the execution latencies that 
cannot be overlapped by the in-order architecture. In addition, because the Xeon Phi features two 
pipelines per core, it cannot reach its peak performance at less than 2 threads per core. Just like 
on the Haswell architecture, we hnd the largest performance gains on the reaction solver, with a 
geomean speedup of 1.80 across problems between 60 and 240 threads (1 and 4 threads/core). On 
Xeon Phi, however, hyperthreading is also beneficial to multiresolution mesh adaptation (geomean 
1.32) and diffusion (geomean 1.33), highlighting the importance of also overlapping the latencies of 
memory accesses. 


5 Concluding remarks and prospects 

The main purpose of the present work was to introduce shared-memory parallelism in a thoughtful 
way, to a tailored numerical strategy previously developed to tackle the simulation of stiff reaction- 
diffusion models disclosing propagating fronts |27j . In order to achieve this, a new data structure 
was conceived along with parallel-friendly implementations of the key algorithms into play, including 
the ROCK4 and Radau5 solvers. Shared-memory parallelism was introduced using the task-based 
TBB runtime library, which turned out to be particularly well suited to our purposes. 

The numerical performance of the resulting implementation was assessed for three reaction- 
diffusion models, reaching a very satisfactory level of efficiency on shared-memory architectures 
such as the Intel E5-2650 and the Xeon Phi (MIC) 5110P. For models involving a certain level of 
modeling complexity such as BZ or STROKE, high scalability is achieved mainly because of the 
very efficient performance of the reaction solver within a splitting context. In addition to that, the 
present parallel implementation of the diffusion solver and the multiresolution algorithm shows also 
to perform reasonably well, given the limitations commonly encountered in standard computing 
architectures, as previously discussed. Einally, simultaneous multithreading is shown to improve 
considerably the computational performance, in particular on many-core architectures. 

The proposed strategy is thus suitable for the simulation of multi-dimensional time-space multi¬ 
scale problems, which would be out of reach using classical approaches on standard computing 
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resources. In particular these reaction-diffusion problems are representative building blocks of more 
complex and realistic applications, encountered, for instance, in biomedical engineering |28) . com¬ 
bustion [5^, or plasma physics [23]. In this context ongoing work encompasses the extension to 
arbitrary domains, taking into account, for example, that the STROKE model applies to the hu¬ 
man brain |28j : and to more general diffusion problems, as used, for example, in multi-component 
transport modeling for combustion problems m- A key extension of the present work deals with 
the implementation of hybrid parallelism using both shared- and distributed-memory formalisms. 
In this regard we believe the implementations and findings of this work are relevant to distributed- 
memory codes as well. Achieving strong scaling is particularly challenging for multiresolution and 
AMR codes, because of load imbalance and fine-grain communications. For some applications, 
shared-memory parallelism with dynamic load balancing may allow distributing work evenly be¬ 
tween the CPU cores within an MPI rank, thereby helping reduce total imbalance. We thus expect 
shared-memory parallelism to gain in relevance for complex and imbalanced MPI applications. 
These are topics of our current research. 
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A Numerical experimentation on uniform Cartesian meshes 

Let us consider the numerical performance of the dedicated Strang scheme introduced in §2.1[ to 
solve 0 on a uniform Cartesian grid. Table shows the percentages of total wall-clock time spent 
in the diffusion step of the splitting strategy. 


Grid size 

NAGUMO 

BZ 

STROKE 

5123 

92.5 

34.5 

1.6 

10242 

88.5 

20.1 

1.0 


Table 4: Percentages of wall-clock time employed to solve the diffusion problem. 

Notice that for the models of higher complexity, that is, BZ and STROKE, most of the simulation 
time is spent in the reaction step. Consequently, as a measure of the computational complexity, 
we use the number of evaluations of the right-hand side performed by the RadauS program on a 
given grid-node during the i?At/ 2 "Step. This complexity measure may considerably vary throughout 
the computational domain, given that the splitting time step At/2 can be further split according 
to the time-stepping procedure and that an iterative, simplified Newton method is implemented 
to solve the nonlinear systems. In particular the minimum complexity corresponds to grid-nodes 
where only one reaction sub-step is required during a given At/2, as well as one iteration of the 
Newton method. 

Considering a uniform mesh of 1024^ grid-nodes. Figures and show the computational 
complexity measured for a given At/2 for the BZ and STROKE models, respectively. For the BZ 
problem the Jacobian is computed analytically involving three right-hand side evaluations. In this 
case 87% of the CPU time is spent over regions where the complexity is less than 17. These are 
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Complexity 



Figure 6: BZ problem. Histogram of computational complexity (left). In the right picture, white 
zones correspond to points where the complexity is greater than 17. 



Complexity 



Figure 7: STROKE problem. Histogram of computational complexity (left). In the right picture, 
white zones correspond to points where the complexity is greater than 80. 


the regions where the solution is close to the reaction equilibrium and therefore, the complexity 
remains also close to its minimum value. Similarly, about 90 % of the reaction CPU time is spent 
in regions where the complexity is less than 60 for the STROKE model. Here the Jacobian is 
computed numerically. 

B Choice of TBB for multiresolution applications 

In this work we have initially focused our attention on OpenMP API and TBB, both well-established 
options for shared-memory parallelism. 

The OpenMP standard is a widely supported and popular choice, especially for high-performance 
computing (HPC) codes. It relies on compiler directives to introduce parallel regions and work¬ 
sharing constructs and tasks, which are executed in parallel by a runtime library component. 
OpenMP has been supporting tasking since version 3, and has recently introduced more sophisti¬ 
cated task constructs with version 4, with the taskgroup and depend constructs. 

A key design characteristic of the OpenMP API is the reliance on a thread-based approach. 
That is, parallelism is defined explicitly by parallel regions which enforce parallel execution, and 
work and tasks are scheduled across threads participating in their containing parallel region. The 
set of threads taking part in a region is determined when entering the region, and fixed for its entire 
duration. Unfortunately, this severely limits opportunities for exploiting nested parallelism with 
OpenMP, even though the standard supports nested parallel regions. Actually, due to the fact that 
threads are assigned to a region for its whole duration, there can be no dynamic load balancing 
or work-sharing across parallel regions. In addition, creating an OpenMP region requires knowing 
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about the outside parallel execution context. That is, since OpenMP parallelism is mandatory, 
nesting parallel regions may lead to exponential creation of OpenMP worker threads for some 
runtime implementations, resulting in oversubscription. Finally, since all threads of a parallel region 
are required to enter an OpenMP work-sharing construct, one cannot arbitrarily nest OpenMP tasks 
and work-sharing constructs. In particular one cannot introduce parallelism by nesting omp for 
constructs within task constructs. 

Because of the aforementioned points, OpenMP parallelism is not composable, that is, it is 
not possible to expose parallelism in an opportunistic way, regardless of the outer context from 
which a piece of code may be called. For the development of a C-|—I- multiresolution code, we 
found composability to be a desirable property, as it allows introducing parallelism at any level 
in the code, without knowledge or constraint on the outside calling context. This is particularly 
important for codes with complex call graphs and multiple call paths, as often found in C-I--I- when 
using classes for code reuse. 

Intel TBB, on the other hand, relies on a different approach to parallel programming. It is an 
open-source C-I--I- template library for task parallelism, and requires no special support in the com¬ 
piler. In addition to task-based parallel constructs, TBB also provides concurrent data structures 
and memory allocators. The library implements a composable task-based runtime, meaning that 
any C-|—h function can expose parallelism with (potentially arbitrarily nested) constructs such as 
parallel_for or parallel_reduce, irrespective of the calling context. Dynamic load balancing is 
achieved through a work-stealing runtime. Since both tasks and constructs such as parallel_for 
rely on the same task-based scheduler, TBB parallel idioms may be combined and nested arbitrarily, 
as long as the programmer avoids races through appropriate synchronizations. In particular, TBB’s 
composability allows us to achieve good scalability with the adaptive multiresolution algorithm (see 
§ 3.1.2) by recursively nesting parallel_for constructs to expose parallelism in a way that follows 
the tree structure. In addition to a composable task runtime, TBB provides a parallel algorithm 
library with support for parallel reduction and sorting. 

We chose TBB over an OpenMP implementation mainly because the former provides an efficient 
task runtime with dynamic load balancing, composability of parallel codes, and a set of parallel data 
structures and algorithms. Equivalent functionality could have been achieved using OpenMP tasks, 
but more effort would have been necessary to circumvent some of the aforementioned shortcomings. 


C A compact data structure for diffusion sparse matrices 

When the diffusion coefficients, ei{x) in Q, are constant, it is possible to further reduce the amount 
of memory necessary to store the matrix entries associated with the diffusion problem, leading to 
lower memory bandwidth requirements for the program. Notice that for a given node, the matrix 
entries defined by the spatial discretization of the Laplace operator can have three possible values, 
depending on whether the neighboring nodes belong to the same, upper or lower grid-level. We 
therefore use the data structure shown in Figure It consists of two arrays; the array A contains 
pointers to the beginning of lines in the second array K, where each line corresponds to one node 
in the adapted grid. The array K contains line descriptors, one for each node in the adapted grid, 
defined as follows. The first integer of a line descriptor is a 32-bit integer, containing the grid-level 
j of the current node, followed by the number of neighboring nodes at grid-level j , j -\- 1 , and j — 1, 
as shown in Figure The following integers then indicate the corresponding neighboring nodes. 
As an illustration, Figure shows a typical line descriptor. 

Table compares the wall-clock times obtained when computing one diffusion step with the 
classical CSR data structure and the aforementioned compact storage, for the BZ model. Simulations 
on the Haswell platform were performed using 20 cores and 40 threads, whereas for Xeon Phi, 60 
cores and 240 threads were considered. Computational gains are rather weak for two-dimensional 
simulations, for which the matrices remain relatively small. However, better performances are 
obtained with the compact structure when dealing with three-dimensional matrices. 
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Figure 8: Compact data structure for diffusion matrices. The right part shows the structure of a 
line descriptor. 


I « 


Figure 9: Illustration of a line descriptor for a given node n at grid-level j. 
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